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Dealing with dimensions is the great challenge, due to “curse of dimensionality”, for effective outlier detection. In a high dimensional 
data space, it is difficult to detect most related points and most unrelated points. Outlier is the most unrelated points and in the 
high dimensional data all data points seemed to be a good outlier, which is a great challenge to identify. In this paper, we propose a 
concept called Reverse Nearest Neighbor (RNN) method. Here, some points which occur most frequently known as hub-points and 
some points which occur rarely called anti-hub points are identified in a KNN list. In addition to this we show the relationship 
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between outliers and anti-hub points by correlation factor. Unsupervised learning helps to find the clear outliers; throughout this 


paper we deal with both synthetic data and real data, to detect the clear outliers. Based on analysis, the proposed work shows that 
RNN-distance based similarity provides higher percentile score to detect outliers when compared to basic knn approach and ABOD 
method. Based on this distance based method ID3 approach has done to enhance the better outlier detection. 
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1. INTRODUCTION 


Outlier means the points in extreme level i.e., the points behave abnormally. It is an anomaly detection where the extreme points 


refer to. This outlier detection is widely applied in practice since there is no formal mathematical definition detection. Outlier 
detection can be done in three methods like supervised, semi-supervised and unsupervised. But unsupervised method is most 
widely applied is because it is not possible for all data points would contain labels [1]. Unsupervised method concentrate mainly on 
distance due to “curse of dimensionality” [5][11]. As dimensionality increases all points in data are seemed to be a good outlier [8]. 
Hence distance concentration is done here for dealing high dimensional data. It is difficult to understand how increase in dimension 
impacts the outlier detection. Outlier is the noisy data that leads to unwanted data points to enter in, hence it is necessary to avoid 
such data and among the detection unsupervised method plays a wide role in medical research, military applications etc. The “curse 
of dimensionality” is the accepted fact that all points are identified as an equal outlier in a high-dimensional data. 
Reverse nearest neighbor is proposed in this paper to identify some points clearly that behave as outlier. This technique uses the 
hubness phenomenon [6] to identify the clear outlier. Since the anomaly detection helps to identify the intruders it is necessary to 
find the related and unrelated points. No unrelated points can come under related points. So here hubness phenomenon helps to 
look deeply the points occurring often and the points that occur rarely and is based on local and global [9][10]. This affects the 
reverse nearest counts i.e. k-occurrences, the number of times point ‘x’ occurs among the k-nearest neighbor of other points. Some 
points occur frequently in the KNN list called hub points and some points occur rarely in the KNN list called antihub points. This is 
refered as the infrequent members. This antihub points help in detecting the outlier by a correlation factor [7]. Here how the 
emergences of antihub points occur and how it leads to detection of outlier are considered. In low dimensional data the antihub 
points are easy to identify and in high dimensional data it is identified by increasing the neighborhood size almost equal to the size 
of the data. This increase in neighbor size leads to effective detection of outlier and avoid the noisy data. Based on the relation 
between outlier and antihub in both high dimensional and low dimensional data ODIN [2] method was considered. 

The occurrences of k are not regular in all cases. It differs based on the dimensions. It is well suited for low dimensional data and 


for high dimensional data [12][13] is not suited well. Hence reverse nearest neighbor technique is used. 


2. RELATED WORK 


In this section, we describe the existing work related to the outlier detection. Chandola, Banerjee and kumar [1] proposd a survey for 
anomaly detection and shown that unsupervised learning technique is applied most widely since there is no more labeled data 
available. As internet grows and many data are uploaded daily with unlabelled data this technique is reduced. They also proved that 
time complexity is reduced by using this technique. 

Hautamaki, Karkkainen and Franti [2] proposed a method of outlier detection based on k-nearest neighbor graph. This method 
performs the construction of KNN graph and found the anomaly point with the node with less number of in-degree i.e. with less 
number of nearest node. Threshold is fixed based on the analysis of all the nodes and detection of outlier is almost accurate in this 
method. 


Aggarwal and Yu [3] describe a method of outlier detection for high dimensional data. This shows how the increase in dimension 
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impacts the outlier detection. As the dimensionality increases the outlier points are skewed for each case[15]. This paper gives the 


survey of how high dimensional data works for finding distance in unsupervised learning. Skewness is measured for the outlier score 
so that variation is viewed clearly to find the change in low and high dimension data. 

Kriegal, Schubert and Zimek [4] proposed a method called Angle Based Outlier Detection in which all data points are considered 
as node and based on the grid format the angle threshold is fixed to find the most neighborhood points among all the data points. 
This threshold differs for each dataset mainly based on the dimensionality. 

Zimek, Kriegal and Schubert [5] also proposed a survey of how unsupervised learning helps to find outlier in high dimensional 
data. As internet grows in uploading lot of data without class label it is difficult to label all the data points. Clustering technique 
helps to find the anomaly for unlabelled data. 

The result of the related work illustrates that dealing high dimensional data is most challenging factor for detecting outlier. The 
KNN and ODIN methods help to develop the proposed work for finding the effective outlier points. The analysis shows that reverse 
nearest neighbor technique is necessary for the unsupervised learning. The accuracy of outlier is high and here it is dealed with real 
real datasets to show the accuracy in outlier. The skewness is also measured for the outlier score for both high as well as and low 


dimensional data. 


Pre-Processing 


Fig 1: System Description 


3. SYSTEM DESCRIPTION 


The outlier detection is done by the following methodologies for many real datasets. 
e —_ Euclidean distance 
e = Calculating Nx(x) 
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e Outlier score and performance analysis 


The figure1 shows that the overview of the proposed system. Pre-processing is done for eliminating missing values and making 
the data ready for the proposed work is done. Distance calculation is by default is a Euclidean Distance. This distance is for finding 
neighbor among the data points. The distance is an input for knn classification. The knn classification results in the list k-nearest 
neighbor for all points in the dataset. Among the list of k-nearest neighbor we find some points occur rarely and some points occur 
frequently known as antihub points and hub points respectively. Then finally outlier score is calculated and the outlier points are 
identified. 


4. IMPLEMENTATION 
This briefly discusses about the implementation methodology of the proposed work. The proposed system implemented with the 
following modules 

o Euclidean distance 

o KNN classification 

o = Calculating Nx(x) 


o Determining Outlier Score 


A. Euclidean Distance 
The distance between two points of the xy-plane can be found using the distance formula. The distance between (x1, y1) and (x2, y2) 


is given by: 
d=V(xl-x2¥ +(yl-y2Y (1) 


Similarly for high dimensional data considering all the dimensions the distance value is calculated. 


B. KNN classification 
Given point x the k most similar points to x is found i.e. k nearest neighbor. This is calculated based on distance calculated in 


equation 1. If the distance is less it is considered as a neighbor point similarly k nearest for each point is calculated. 


Classification steps: 

Step 1: Find the class label for each of the k neighbor. 

Step 2: Use a voting or weighted voting approach to determine the majority class among the neighbor. 
Step 3: Weighted voting means the closest neighbors count 


Step 4: Assign the majority class label to x 


This classification results in the list of K nearest neighbor for all points in the dataset. This helps in finding the hub points and 


antihub points. 


C. Calculating Ni(x) 
Let D € R¢ be a finite set of n points. For point x € D and a given distance or similarity measure, the number of k occurrences, 


denoted N(x), is the number of times x occurs among the k nearest neighbors of all other points in D. 
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For q € (0,1) hubs are the points x € D with highest value of N(x). 


For q € (0,1) antihubs are the points x € D with lowest value of N(x). 


In the list of k-nearest neighbor of all points in the dataset , the number of occurrences of each points is calculated. Here 
considering the dimensions, the k value is set almost to the size of data. Otherwise the hub points will be very less and antihub 
points will be high. To conclude, in high dimensions every data point is far away from the mean data and is difficult to find the 
outlier points . If a constant threshold value is used then it will erroneously classify every data point as an outlier. However, can use 
statistical theory to define a reliable rule to detect outliers, regardless of the dimension of the data. This is one of the great challenge 
in this proposed work. The threshold value only decides the antihub points that leads to the outliers. So this is very carefully fixed. 


Thus the varying threshold helps to find more accurate outlier and is effectively done here. 


D. Determining Outlier Score 


The outlier score is determined by the following proposed algorithm. 


Algorithm 1 : Antihubgist 


Step 1: t = Ni(xi) computed with respect to distance (2) 
step 2: 5\ = f(t) (3) 


where f(t) = [1 / (1+ Ni(xi))] is a monotone function. The disadvantage of the antihub algorithm is contributing the two factor. They 
are hubness property and inherent property. In order to add more discrimination one approach could be to raise k, possibly to some 


value comparable with n. 


But the approach raises two main factors: 
e With increasing k the notion of outlierness moves from local to global, thus if local outliers are concentrated in this 
methodology. 


e k-values comparable with n raise issues with computational complexity. 


Algorithm 2 : Antihub? 
AntiHub?, which refines outlier scores produced by the AntiHub method by also considering the Nx scores of the neighbors of x, in 
addition to N(x) itself. For each point x, AntiHub? proportionally adds (1-a). Nx(x) to a times the sum of Nx scores of the k-nearest 


neighbor of x where a € [0,1] . 


Step 1: z=Antihub dist 

This is the outlier score of the previous step 

step 2: yi = >j © N Ndist(k,i) zj, (4) 
where Ndist (k,i) is the set of indices of k nearest 

neighbors of xi and initialize temp=0 

step 3: Si: =(1-@.ait+a.yi (5) 
Here a value is chosen between (0,1) 

Step 4: calculation of discScore 


dis : = discScore (S,p) (6) 
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This function returns the value u/np where u is the number of unique points in yi and n is the number data. 


Step 5: if dis > temp then t is set to S and disc is set to temp 
Step 6 : Outlier Score Calculation : si = f(t) 


where f(t) = [1 / (1+ Nk(xi))] is a monotone function. 


Thus anithub? algorithm refines the outlier score. This score proves that the accuracy in determining the outlier with various 
values of k is high. This proposed algorithm reduces the time complexity and manages the whole system contributions in a good 
formation. The different datasets are used and results are taken for the analysis. The dataset varies with the dimension which is 
concentrated more here. Thus data points are mostly similar to outliers and this method went deeply in finding the data points with 
that are considered as outliers. Time complexity is expensive since the data points are with high dimensions. This algorithm is well 
suited for both low-dimensions and high dimensional values. The ID3 based approach also gives equally good results for finding the 
outlier points irrespective from the dimensions. This approach gives the good sore similar to the antihub? method which suits for the 


both synthetic data and real data. 

5. RESULT ANALYSIS 

Table | Real dataset used in the experiment 
Data set n D Sn10 Outlier% 
Aloi 
Churn 
KDD 
Mammo 
Graphy 
NBA- 


ALLSTA 


Thyroid 
Sick 


This table shows that mammography dataset is the low dimensional data and KDD dataset is the high dimensional data. the 
outlier detection for the proposed method suits well for all the above cases with the greater accuracy comparing KNN algorithm. 
The proposed method also reduces the time complexity since there is no need of assigning labels for all data points [16]. 
Considering the binary classification in KNN it is necessary to give class label for all points and it increases the time complexity. The 
advantage is that skewness value differs for each dimension as the dimension plays the major role here. Real dataset is used here for 


better analysis by considering all the challenges and the outliers are obtained as follows. Table | describes the skewness and outlier 
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for different dataset for antihub* and ID3 approach. The skewness value is found low and outlier percentage is high in ID3 method. 


The data points found as outlier are almost similar to antihub?. 


Table II strong outlier for NBA ALLSTAR data in antihub’ 


NAME 


ID3 

Kareem Abdul 
Jabbar Yes 
Shawn Marion Yes 
Dimton jack Yes 
Vilowon Kae Yes 


This table shows the strong outliers that are detected in the nba-allstar dataset. These points are considered as a noisy data or 
the unwanted data i.e. anomaly. The points can be removed and the data points can be made available for further use. Threshold 
value is fixed based on the analysis for each data points to get antihub points. Finally outlier score was determined. 


Table III Strong outlier for NBA ALLSTAR data in antihub? 


NAME 


ID3 

Kareem Abdul — Jabbar Yes 
Shawn Marion Yes 
Dimton jack Yes 
Vilowon Kae Yes 


This table gives the refinement of the outlier score in Antihub? algorithm. This makes the correctness of the score and shows the 
there is no wide change in the outlier points. Some of the points are misclassified as good points in antihub! are correctly classified 
in this antihub? algorithm. Thus it gives the refinement of outlier score. Since nba-allstar dataset works well for all the data algorithm 
this is shown in brief. The effective reduction of time complexity is achieved by avoiding the class labels of using unsupervised 


learning methodology. 


Rohini et al. 
Outlier Detection on High Dimensional Data Using RNN, 
Indian Journal of Engineering, 2016, 13(31), 98-107, 


Page 1 04 


ARTICLE 


a=@=e= KNN 
ef Antihub1 


eet Antihub2 
=== /D3 


Figure 1 Outlier for NBA-ALL STAR dataset 
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Figure 2 Outlier for ALO! Dataset 
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Figure 3 Outlier for WILT Dataset 
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This figure describes the outlier detection for the three methods where antihub* method finds the better result. It is to be noted 


that skewness is reduced for the antihub* method and ID3 method, so the accuracy will also be high correspondingly. One of the 
challenging is fixing threshold value and it is done on the analysis. Here k is the number of neighbor value and AUC is the Area 
under curve is by f(t) used for finding the outlier score. This is related to the detection of points where existence of anomaly are 


identified inn higher accuracy rate comparing to KNN, ABOD and ODIN methods. 


6. CONCLUSION AND FUTURE WORK 


This section concludes with the discussion of obtained result and future enhancement. The outlier detection is one of the most 
important work in data mining. This was detected effectively in the proposed algorithm. Antihub* algorithm is more efficient than 
knn and antihub! algorithms. Some points are seemed to be good outliers in the antihub' algorithm this is due to the curse of 
dimensionality. It is overcome in the antihub? algorithm in a successful manner. The class label was not present in the unsupervised 
learning and therefore it may loss some accuracy. The time complexity is less compared to the supervise leaning methodologies. 
This work finds the great support in testing all the real dataset and it continues the experiment in a high accuracy that leads to a 
great success of this experiment. The high dimensional data leads all points to be a good outlier. To overcome this challenge they 
introduced hubness phenomenon and that is well suited for the proposed algorithm. The existence of hubs and antihubs in high- 
dimensional data is relevant to machine-learning techniques from various families: supervised, semi-supervised, as well as 
unsupervised. Based on the analysis, they formulated the AntiHub method for unsupervised outlier detection discussed its 
properties, and proposed a derived method which improves discrimination between scores. The main hope is that this article 
clarifies the picture of the interplay between the types of outliers and properties of data, filling a gap in understanding which may 
have so far hindered the widespread use of reverse neighbor methods in unsupervised outlier detection. In future this technique can 


be applied for Image Mining. 
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